Choropleth maps show geographical areas or regions that are coloured, shaded or patterned based on the data.
We see this often during election days.
Each of these maps shows data for the same event, but the impressions they convey are very different. Each faces two main problems.
# Try to comment each line
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6 ✔ purrr 0.3.4
## ✔ tibble 3.1.8 ✔ dplyr 1.0.9
## ✔ tidyr 1.2.0 ✔ stringr 1.4.0
## ✔ readr 2.1.2 ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(jtools)
setwd("~/Documents/ibs_course/BUS240/data")
load('election.rda')
names(election)
## [1] "state" "st" "fips" "total_vote" "vote_margin"
## [6] "winner" "party" "pct_margin" "r_points" "d_points"
## [11] "pct_clinton" "pct_trump" "pct_johnson" "pct_other" "clinton_vote"
## [16] "trump_vote" "johnson_vote" "other_vote" "ev_dem" "ev_rep"
## [21] "ev_oth" "census"
election %>% select(state, total_vote, r_points, pct_trump, party, census) %>%
sample_n(7)
## # A tibble: 7 × 6
## state total_vote r_points pct_trump party census
## <chr> <dbl> <dbl> <dbl> <chr> <chr>
## 1 Connecticut 1644920 -13.6 40.9 Democratic Northeast
## 2 Alaska 318608 14.7 51.3 Republican West
## 3 West Virginia 721233 41.7 67.8 Republican South
## 4 District of Columbia 311268 -86.8 4.09 Democratic South
## 5 Wisconsin 2976150 0.770 47.2 Republican Midwest
## 6 Hawaii 428937 -32.2 30.0 Democratic West
## 7 Pennsylvania 6166729 0.710 48.2 Republican Northeast
State-level vote totals and shares for the 2016 US Presidential election. The variables are as follows:
state. State name
total_vote. Total votes cast
r_points. Percentage point difference between Trump share and Clinton
pct_trump. Trump vote share
party. Winning party
census. Census region
Let’s recall the Cleveland plot
party_colors <- c("#2E74C0", "#CB454A")
# Try to comment each line
p0 <- ggplot(data = subset(election, st %nin% "DC"),
mapping = aes(x = r_points,
y = reorder(state, r_points),
color = party))
p1 <- p0 + geom_vline(xintercept = 0, color = "gray30") + geom_point(size = 2)
p2 <- p1 + scale_color_manual(values = party_colors)
p3 <- p2 + scale_x_continuous(breaks = c(-30, -20, -10, 0, 10, 20, 30, 40),
labels = c("30\n (Clinton)", "20", "10", "0",
"10", "20", "30", "40\n(Trump)"))
p3 + facet_wrap(~ census, ncol=2, scales="free_y") +
guides(color="none") + labs(x = "Point Margin", y = "") +
theme(axis.text=element_text(size=8))
Note that you don’t have to represent spatial data spatially. Of course, spatial representations can be beneficial and look quite professional. We need to think hard about what we’d gain from the spatial representations. It’d lose information a lot.
The first task in drawing a map is to get a data frame with the right information in it, and in the right order. We use maps library which provides some pre-drawn map data.
library(maps)
##
## Attaching package: 'maps'
## The following object is masked from 'package:purrr':
##
## map
us_states <- map_data("state")
head(us_states)
## long lat group order region subregion
## 1 -87.46201 30.38968 1 1 alabama <NA>
## 2 -87.48493 30.37249 1 2 alabama <NA>
## 3 -87.52503 30.37249 1 3 alabama <NA>
## 4 -87.53076 30.33239 1 4 alabama <NA>
## 5 -87.57087 30.32665 1 5 alabama <NA>
## 6 -87.58806 30.32665 1 6 alabama <NA>
glimpse(us_states)
## Rows: 15,537
## Columns: 6
## $ long <dbl> -87.46201, -87.48493, -87.52503, -87.53076, -87.57087, -87.5…
## $ lat <dbl> 30.38968, 30.37249, 30.37249, 30.33239, 30.32665, 30.32665, …
## $ group <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ order <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1…
## $ region <chr> "alabama", "alabama", "alabama", "alabama", "alabama", "alab…
## $ subregion <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
We can make a blank state map right away with this data, using geom_polygon()
p <- ggplot(data = us_states,
mapping = aes(x = long, y = lat, group = group))
p + geom_polygon(fill = "white", color = "black")
The map is plotted with latitude and longitude points, which are there as scale elements mapped to the x and y axes. A map is, after all, just a set of lines drawn in the right order on a grid.
p <- ggplot(data = us_states,
aes(x = long, y = lat,
group = group, fill = region))
p + geom_polygon(color = "gray90")
Tell R not to plot a legend.
p <- ggplot(data = us_states,
aes(x = long, y = lat,
group = group, fill = region))
p + geom_polygon(color = "gray90") + guides(fill = 'none')
Thin the lines to make the state borders
p <- ggplot(data = us_states,
aes(x = long, y = lat,
group = group, fill = region))
p + geom_polygon(color = "gray90", size = .1) + guides(fill = 'none')
Techniques for map projection are a fascinating world of their own, but for now, just remember we can transform the default projection used by geom_polygon(), via the coord_map() function. You’ll remember that we said that projection onto a coordinate system is a necessary part of the plotting process for any data.
Normally it is left implicit, and we use Mercator projection, which looks like a 2 X 2 panel.
The Albers projection requires two latitude parameters, lat0 and lat1. We give them their conventional values for a US map here.
# play around lat0 and lat1 values
p + geom_polygon(color = "gray90", size = 0.1) +
coord_map(projection = "albers", lat0 = 39, lat1 = 45) +
guides(fill = "none")
So what are we doing? We have a state-level map.
Now we need to get our own data on to the map. Remember, underneath that map is just a big data frame specifying a large number of lines that need to be drawn. We have to merge our data with that data frame. Let’s go back to election data
head(election)
## # A tibble: 6 × 22
## state st fips total…¹ vote_…² winner party pct_m…³ r_poi…⁴ d_poi…⁵ pct_c…⁶
## <chr> <chr> <dbl> <dbl> <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Alab… AL 1 2.12e6 588708 Trump Repu… 0.277 27.7 -27.7 34.4
## 2 Alas… AK 2 3.19e5 46933 Trump Repu… 0.147 14.7 -14.7 36.6
## 3 Ariz… AZ 4 2.60e6 91234 Trump Repu… 0.035 3.5 -3.5 44.6
## 4 Arka… AR 5 1.13e6 304378 Trump Repu… 0.269 26.9 -26.9 33.6
## 5 Cali… CA 6 1.42e7 4269978 Clint… Demo… 0.300 -30.0 30.0 61.5
## 6 Colo… CO 8 2.78e6 136386 Clint… Demo… 0.0491 -4.91 4.91 48.2
## # … with 11 more variables: pct_trump <dbl>, pct_johnson <dbl>,
## # pct_other <dbl>, clinton_vote <dbl>, trump_vote <dbl>, johnson_vote <dbl>,
## # other_vote <dbl>, ev_dem <dbl>, ev_rep <dbl>, ev_oth <dbl>, census <chr>,
## # and abbreviated variable names ¹total_vote, ²vote_margin, ³pct_margin,
## # ⁴r_points, ⁵d_points, ⁶pct_clinton
## # ℹ Use `colnames()` to see all variable names
print("See the values in the state column")
## [1] "See the values in the state column"
head(election$state)
## [1] "Alabama" "Alaska" "Arizona" "Arkansas" "California"
## [6] "Colorado"
Interpret the following codes:
election$region <- tolower(election$state)
us_states_elec <- left_join(us_states, election)
## Joining, by = "region"
Now we have a merged data.
head(us_states_elec)
## long lat group order region subregion state st fips total_vote
## 1 -87.46201 30.38968 1 1 alabama <NA> Alabama AL 1 2123372
## 2 -87.48493 30.37249 1 2 alabama <NA> Alabama AL 1 2123372
## 3 -87.52503 30.37249 1 3 alabama <NA> Alabama AL 1 2123372
## 4 -87.53076 30.33239 1 4 alabama <NA> Alabama AL 1 2123372
## 5 -87.57087 30.32665 1 5 alabama <NA> Alabama AL 1 2123372
## 6 -87.58806 30.32665 1 6 alabama <NA> Alabama AL 1 2123372
## vote_margin winner party pct_margin r_points d_points pct_clinton
## 1 588708 Trump Republican 0.2773 27.72 -27.72 34.36
## 2 588708 Trump Republican 0.2773 27.72 -27.72 34.36
## 3 588708 Trump Republican 0.2773 27.72 -27.72 34.36
## 4 588708 Trump Republican 0.2773 27.72 -27.72 34.36
## 5 588708 Trump Republican 0.2773 27.72 -27.72 34.36
## 6 588708 Trump Republican 0.2773 27.72 -27.72 34.36
## pct_trump pct_johnson pct_other clinton_vote trump_vote johnson_vote
## 1 62.08 2.09 1.46 729547 1318255 44467
## 2 62.08 2.09 1.46 729547 1318255 44467
## 3 62.08 2.09 1.46 729547 1318255 44467
## 4 62.08 2.09 1.46 729547 1318255 44467
## 5 62.08 2.09 1.46 729547 1318255 44467
## 6 62.08 2.09 1.46 729547 1318255 44467
## other_vote ev_dem ev_rep ev_oth census
## 1 31103 0 9 0 South
## 2 31103 0 9 0 South
## 3 31103 0 9 0 South
## 4 31103 0 9 0 South
## 5 31103 0 9 0 South
## 6 31103 0 9 0 South
Now that everything is in one big data frame, we can plot it in a map.
p <- ggplot(data = us_states_elec,
aes(x = long, y = lat, group = group, fill = party))
p + geom_polygon(color = "gray90", size = 0.1) +
coord_map(projection = "albers", lat0 = 39, lat1 = 45)
Add some details:
p0 <- ggplot(data = us_states_elec,
mapping = aes(x = long, y = lat, group = group, fill = party))
p1 <- p0 + geom_polygon(color = "gray90", size = 0.1) +
coord_map(projection = "albers", lat0 = 39, lat1 = 45)
p2 <- p1 + scale_fill_manual(values = party_colors) +
labs(title = "Election Results 2016", fill = "none")
library(ggthemes)
p2 + theme_map()
With the map data frame in place, we can map other variables. Let’s try a continuous measure, such as the percentage of the vote received by Donald Trump, pct_trump
p0 <- ggplot(data = us_states_elec,
mapping = aes(x = long, y = lat, group = group, fill = pct_trump))
# The default color used in the p1 object is blue.
p1 <- p0 + geom_polygon(color = "gray90", size = 0.1) +
coord_map(projection = "albers", lat0 = 39, lat1 = 45)
p1 + labs(title = "Trump vote") + theme_map() + labs(fill = "Percent")
Want to change the direction of gradation
p2 <- p1 + scale_fill_gradient(low = "white", high = "#CB454A") +
labs(title = "Trump vote")
p2 + theme_map() + labs(fill = "Percent")
For election results, we might prefer a gradient that diverges from a midpoint. The scale_gradient2() function gives us a blue-red spectrum that passes through white by default.
p0 <- ggplot(data = us_states_elec,
mapping = aes(x = long, y = lat, group = group, fill = d_points))
p1 <- p0 + geom_polygon(color = "gray90", size = 0.1) +
coord_map(projection = "albers", lat0 = 39, lat1 = 45)
p2 <- p1 + scale_fill_gradient2() + labs(title = "Winning margins")
p2 + theme_map() + labs(fill = "Percent")
Purple America map
p3 <- p1 + scale_fill_gradient2(low = "red", mid = scales::muted("purple"),
high = "blue", breaks = c(-25, 0, 25, 50, 75)) +
labs(title = "Winning margins")
p3 + theme_map() + labs(fill = "Percent")
Why does it look so blueish? This is because Washington DC is included
in the data, and hence the scale. Even though it is barely visible on
the map, DC has by far the highest points margin in favor of the
Democrats of any unit of observation in the data. If we omit it, we’ll
see that our scale shifts in a way that does not just affect the top of
the blue end, but re-centers the whole gradient and makes the red side
more vivid as a result.
p0 <- ggplot(data = subset(us_states_elec,
region %nin% "district of columbia"),
aes(x = long, y = lat, group = group, fill = d_points))
p1 <- p0 + geom_polygon(color = "gray90", size = 0.1) +
coord_map(projection = "albers", lat0 = 39, lat1 = 45)
p2 <- p1 + scale_fill_gradient2(low = "red",
mid = scales::muted("purple"),
high = "blue") +
labs(title = "Winning margins")
p2 + theme_map() + labs(fill = "Percent")
setwd("~/Documents/ibs_course/BUS240/data")
load('county_map.rda')
load('county_data.rda')
glimpse(county_map)
## Rows: 191,382
## Columns: 7
## $ long <dbl> 1225889, 1235324, 1244873, 1244129, 1272010, 1276797, 1273832, 1…
## $ lat <dbl> -1275020, -1274008, -1272331, -1267515, -1262889, -1295514, -129…
## $ order <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 1…
## $ hole <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, F…
## $ piece <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ group <fct> 0500000US01001.1, 0500000US01001.1, 0500000US01001.1, 0500000US0…
## $ id <chr> "01001", "01001", "01001", "01001", "01001", "01001", "01001", "…
county_map %>% sample_n(5)
## long lat order hole piece group id
## 1 1667128.6 -524167.3 183694 FALSE 1 0500000US54067.1 54067
## 2 1565944.7 -658729.0 68984 FALSE 1 0500000US21195.1 21195
## 3 101985.6 -1433176.3 157837 FALSE 1 0500000US48093.1 48093
## 4 548620.5 -1126258.7 23253 FALSE 1 0500000US05127.1 05127
## 5 1962495.9 -533602.4 172435 FALSE 1 0500000US51085.1 51085
glimpse(county_data)
## Rows: 3,195
## Columns: 32
## $ id <chr> "0", "01000", "01001", "01003", "01005", "01007", "01…
## $ name <chr> NA, "1", "Autauga County", "Baldwin County", "Barbour…
## $ state <fct> NA, AL, AL, AL, AL, AL, AL, AL, AL, AL, AL, AL, AL, A…
## $ census_region <fct> NA, South, South, South, South, South, South, South, …
## $ pop_dens <fct> "[ 50, 100)", "[ 50, 100)", "[ 50, 100)", "[…
## $ pop_dens4 <fct> "[ 45, 118)", "[ 45, 118)", "[ 45, 118)", "[118,71…
## $ pop_dens6 <fct> "[ 82, 215)", "[ 82, 215)", "[ 82, 215)", "[ 82, …
## $ pct_black <fct> "[10.0,15.0)", "[25.0,50.0)", "[15.0,25.0)", "[ 5.0,1…
## $ pop <int> 318857056, 4849377, 55395, 200111, 26887, 22506, 5771…
## $ female <dbl> 50.8, 51.5, 51.5, 51.2, 46.5, 46.0, 50.6, 45.2, 53.4,…
## $ white <dbl> 77.7, 69.8, 78.1, 87.3, 50.2, 76.3, 96.0, 27.2, 54.3,…
## $ black <dbl> 13.2, 26.6, 18.4, 9.5, 47.6, 22.1, 1.8, 69.9, 43.6, 2…
## $ travel_time <dbl> 25.5, 24.2, 26.2, 25.9, 24.6, 27.6, 33.9, 26.9, 24.0,…
## $ land_area <dbl> 3531905.43, 50645.33, 594.44, 1589.78, 884.88, 622.58…
## $ hh_income <int> 53046, 43253, 53682, 50221, 32911, 36447, 44145, 3203…
## $ su_gun4 <fct> NA, NA, "[11,54]", "[11,54]", "[ 5, 8)", "[11,54]", "…
## $ su_gun6 <fct> NA, NA, "[10,12)", "[10,12)", "[ 7, 8)", "[10,12)", "…
## $ fips <dbl> 0, 1000, 1001, 1003, 1005, 1007, 1009, 1011, 1013, 10…
## $ votes_dem_2016 <int> NA, NA, 5908, 18409, 4848, 1874, 2150, 3530, 3716, 13…
## $ votes_gop_2016 <int> NA, NA, 18110, 72780, 5431, 6733, 22808, 1139, 4891, …
## $ total_votes_2016 <int> NA, NA, 24661, 94090, 10390, 8748, 25384, 4701, 8685,…
## $ per_dem_2016 <dbl> NA, NA, 0.23956855, 0.19565310, 0.46660250, 0.2142203…
## $ per_gop_2016 <dbl> NA, NA, 0.7343579, 0.7735147, 0.5227141, 0.7696616, 0…
## $ diff_2016 <int> NA, NA, 12202, 54371, 583, 4859, 20658, 2391, 1175, 1…
## $ per_dem_2012 <dbl> NA, NA, 0.2657577, 0.2156657, 0.5125229, 0.2621857, 0…
## $ per_gop_2012 <dbl> NA, NA, 0.7263374, 0.7738975, 0.4833755, 0.7306638, 0…
## $ diff_2012 <int> NA, NA, 11012, 47443, 334, 3931, 17780, 2808, 714, 14…
## $ winner <chr> NA, NA, "Trump", "Trump", "Trump", "Trump", "Trump", …
## $ partywinner16 <chr> NA, NA, "Republican", "Republican", "Republican", "Re…
## $ winner12 <chr> NA, NA, "Romney", "Romney", "Obama", "Romney", "Romne…
## $ partywinner12 <chr> NA, NA, "Republican", "Republican", "Democrat", "Repu…
## $ flipped <chr> NA, NA, "No", "No", "Yes", "No", "No", "No", "No", "N…
county_data %>%
select(id, name, state, pop_dens, pct_black) %>%
sample_n(5)
## id name state pop_dens pct_black
## 1 48507 Zavala County TX [ 0, 10) [ 0.0, 2.0)
## 2 08013 Boulder County CO [ 100, 500) [ 0.0, 2.0)
## 3 21077 Gallatin County KY [ 50, 100) [ 0.0, 2.0)
## 4 36015 Chemung County NY [ 100, 500) [ 5.0,10.0)
## 5 04019 Pima County AZ [ 100, 500) [ 2.0, 5.0)
We merge the data frames using the shared FIPS id column:
county_full <- left_join(county_map, county_data, by = "id")
With the data merged, we can map the population density per square mile.
p <- ggplot(data = county_full,
mapping = aes(x = long, y = lat,
fill = pop_dens,
group = group))
p1 <- p + geom_polygon(color = "gray90", size = 0.05) + coord_equal()
p2 <- p1 + scale_fill_brewer(palette="Blues",
labels = c("0-10", "10-50", "50-100", "100-500",
"500-1,000", "1,000-5,000", ">5,000"))
p2 + labs(fill = "Population per\nsquare mile") +
theme_map() +
guides(fill = guide_legend(nrow = 1)) +
theme(legend.position = "bottom")
If you try out the p1 object you will see that ggplot produces a legible map, but by default it chooses an unordered categorical layout. This is because the pop_dens variable is not ordered. We could recode it so that R is aware of the ordering. Alternatively, we can manually supply the right sort of scale using the scale_fill_brewer() function, together with a nicer set of labels. We will learn more about this scale function in the next chapter. We also tweak how the legend is drawn using the guides() function to make sure each element of the key appears on the same row. Again, we will see this use of guides() in more detail in the next chapter. The use of coord_equal() makes sure that the relative scale of our map does not change even if we alter the overall dimensions of the plot.
We can now do exactly the same thing for our map of percent Black population by county. Once again, we specify a palette for the fill mapping using scale_fill_brewer(), this time choosing a different range of hues for the map.
p <- ggplot(data = county_full,
mapping = aes(x = long, y = lat, fill = pct_black,
group = group))
p1 <- p + geom_polygon(color = "gray90", size = 0.05) + coord_equal()
p2 <- p1 + scale_fill_brewer(palette="Greens")
p2 + labs(fill = "US Population, Percent Black") +
guides(fill = guide_legend(nrow = 1)) +
theme_map() + theme(legend.position = "bottom")
Even if our data is grouped into spatial units, it is always worth asking whether a map is the best way to present it.
Let’s take our state-level opiates data and redraw it as a time-series plot.
setwd("~/Documents/ibs_course/BUS240/data")
load('opiates.rda')
head(opiates, 5)
## # A tibble: 5 × 11
## year state fips deaths popul…¹ crude adjus…² adjus…³ region abbr divis…⁴
## <int> <chr> <int> <int> <int> <dbl> <dbl> <dbl> <ord> <chr> <chr>
## 1 1999 Alabama 1 37 4.43e6 0.8 0.8 0.1 South AL East S…
## 2 1999 Alaska 2 27 6.25e5 4.3 4 0.8 West AK Pacific
## 3 1999 Arizona 4 229 5.02e6 4.6 4.7 0.3 West AZ Mounta…
## 4 1999 Arkansas 5 28 2.65e6 1.1 1.1 0.2 South AR West S…
## 5 1999 Califor… 6 1474 3.35e7 4.4 4.5 0.1 West CA Pacific
## # … with abbreviated variable names ¹population, ²adjusted, ³adjusted_se,
## # ⁴division_name
Here are the list of variables:
year. Year
state. State name.
fips. State FIPS code.
deaths. Number of opiate-related deaths.
population. Population.
crude. Crude death rate.
adjusted. Adjusted death rate.
adjusted.se. Standard error of Adjusted death rate.
region. Census region. (Stored as an ordered factor.)
abbr. Abbreviated state name.
division_name. Census Division. (Character.)
We could just plot the trends for every state, as we did at the very beginning with the gapminder data.
p <- ggplot(data = opiates,
mapping = aes(x = year, y = adjusted,
group = state))
p + geom_line(color = "gray70")
## Warning: Removed 17 row(s) containing missing values (geom_path).
But fifty states is too many lines to keep track of at once.
library(ggrepel)
p0 <- ggplot(data = drop_na(opiates, division_name),
mapping = aes(x = year, y = adjusted))
p1 <- p0 + geom_line(color = "gray70",
mapping = aes(group = state))
p2 <- p1 + geom_smooth(mapping = aes(group = division_name),
se = FALSE)
p3 <- p2 + geom_text_repel(data = subset(opiates,
year == max(year) & abbr !="DC"),
mapping = aes(x = year, y = adjusted, label = abbr),
size = 1.8, segment.color = NA, nudge_x = 30) +
coord_cartesian(c(min(opiates$year),max(opiates$year)))
p3 + labs(x = "", y = "Rate per 100,000 population",
title = "State-Level Opiate Death Rates by Census Division, 1999-2014") +
facet_wrap(~ reorder(division_name, -adjusted, na.rm = TRUE), nrow = 3)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## Warning: Removed 27 rows containing non-finite values (stat_smooth).
## Warning: Removed 17 row(s) containing missing values (geom_path).